這裡介紹的推薦系統模型是在資料數據量不是很多時,
使用KNN演算法來進行預測,
要是資料量跟數據維度更多時,可以在Kaggle 搜尋其他人使用Netflex資料建立的推薦系統。
https://www.kaggle.com/netflix-inc/netflix-prize-data/code
那本篇資料集是使用kaggle的Anime Recommendations Database,
https://www.kaggle.com/CooperUnion/anime-recommendations-database。
引用模組
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
資料及匯入pandas dataframe
anime = pd.read_csv("anime.csv")
特定類別集數通常是一集
anime.loc[(anime["genre"]=="Hentai") & (anime["episodes"]=="Unknown"),"episodes"] = "1"
anime.loc[(anime["type"]=="OVA") & (anime["episodes"]=="Unknown"),"episodes"] = "1"
anime.loc[(anime["type"] == "Movie") & (anime["episodes"] == "Unknown")] = "1"
未知集數,填上平均值
將Unknown即沒有填的資料換成平均值
anime["episodes"] = anime["episodes"].map(lambda x:np.nan if x=="Unknown" else x)
anime["episodes"].fillna(anime["episodes"].median(),inplace = True)
rating特徵轉浮點數,空值補平均值
members特徵轉成浮點數
anime["rating"] = anime["rating"].astype(float)
anime["rating"].fillna(anime["rating"].median(),inplace = True)
anime["members"] = anime["members"].astype(float)
將genre特徵轉換成onehot encoder
anime_features = pd.concat([anime["genre"].str.get_dummies(sep=","),
pd.get_dummies(anime[["type"]]),
anime[["rating"]],anime[["members"]],anime["episodes"]],axis=1)
使用MinMaxScaler 幫助標準化 加速運算速度
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
anime_features = min_max_scaler.fit_transform(anime_features)
np.round(anime_features,decimals=2)
引入最近鄰
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=4, algorithm='ball_tree').fit(anime_features)
distances, indices = nbrs.kneighbors(anime_features)
indices 為一個[],第一個element 是動畫自己的ID,後面的element是最相似(推薦的)的動畫ID
設立function幫助查找
def get_index_from_name(name):
return anime[anime["name"]==name].index.tolist()[0]
所有anime 的名稱
all_anime_names = list(anime.name.values)
搜尋相似的動畫
def print_similar_animes(query=None,id=None):
if id:
for id in indices[id][1:]:
print(anime.loc[id]["name"])
if query:
found_id = get_index_from_name(query)
for id in indices[found_id][1:]:
print(anime.loc[id]["name"])
print_similar_animes(query="Naruto")